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1. Introduction 

In this paper we present a GPU implementation of list-mode reconstruc¬ 
tion algorithm of a 2D strip PET. This detector consists of two parallel bars 
(strips) of scintillator with a photomultiplier attached to each end [U [2]. 


L 



Fig. 1. 2D-strip detector geometry. 


By measuring the time of the arrivals of photons to each of the photomul¬ 
tipliers we can reconstruct the position at which 7 quanta have interacted 
with the scintillators as well as the position along the line-of-response (LOR) 
(see Figure [^. Application of the state of the art electronics developed at 
the Jagiellonian University allowed to achieve the required resolution 13 El- 

A double-strip prototype can be regarded as an elementary part of the 
full 3D “J-PET’j^ detector under construction at our faculty [II El 13 El [7j . 
The detector will consists of cylindrically arranged scintillator strips (as 
shown schematically in figure]^ enabling a full 3D reconstruction. However, 
the two strip prototype is also of interest as a cheap scanning device. 


^ http://koza.if.uj.edu.pl/pet/ 
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Fig. 2. An example of the possible 3D detector geometry of the J-PET detector. 


2. Setup 


The description of the readout system electronics is beyond the scope 
of this paper, we will just assume that for each event we are given three 
numbers {zu, Zd, A/) (see Fig. [^. By convention we use the tilde to denote 
measured quantities as opposite to ’’real” or exact values. The Zu and Zd 
denote respectively the reconstructed position along the upper and lower 
strip and A/ is the difference of the distances along the LOR from the 
emission point to the upper and lower strips 


M = 


yJ{R- eyY + {zu - e^)2 


R + eyY + {zd - e^)2. 


( 1 ) 
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From those measurements the emission position and angle can be recon¬ 
structed directly 


tan0 = 


2R 

1 Al 


2RAI 



^ Vl + tan^ 9 V— Zd + 

(^Zu + Zd + 2y tan 6^ 


+ 


(^Zu Z(i)^l 

VZ- Zd + 4-R2 


(2) 


It is however subject to measurement errors (see the correlation matrix 
description at the end of Section . In figure we present results of such 
direct reconstruction of the phantom depicted in the figure 4a It is clear 


that the resolution of the detector is not sufficient for direct reconstruction 
and statistical reconstruction methods need to be applied. 

The statistical reconstruction is done iteratively using the List-Mode 
version of the Maximal Likelihood Expectation Maximization (MLEM) al¬ 
gorithm. Each iteration of this algorithm defined by the following formula 


p(/)h+i) = 


^ T’(ejj/)p(/)W 

M 

E P{ej\i)s{i)p{{)(^') 

i=l 


( 3 ) 


The p{l) is the sought tracer emission density given as the average number of 
emissions from pixel I during the examination. The P(e\i) is a reconstruction 
kernel that represents the probability that an event originating in pixel i will 
be detected as e. The s{i) is the sensitivity of the pixel i i.e. the probability 
that an event emitted from pixel i will be detected at all. This sensitivity 
can be easily calculated from the geometry: 


s{y,z) = TT ^ ^arctan min ^ 


\L — z + z 
R-y'R+y 


— arctanmax — 


\L z — z 


R-y R+y 


( 4 ) 


In derivation have assumed the detection probability along the strip is con¬ 
stant and that it does not depend on the angle of incidence. This conditions 
should be approximately fulfilled for incidence angles not exceeding 30°. 
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The formula Q can be rewritten as 

E -P(ej|i)/9'(i)W 

2=1 


( 5 ) 


with 

p'{i) ^ s{i)p{i). (6) 

In the following we will give the results of the reconstruction of p'{i). 

The sum over j in ([^ runs over all collected events {ij}. Considering 
that up to hundred millions of events can be collected during a single scan 
this is a very time consuming calculation so the efficient calculation of the 
kernel P is essential. 


3. Kernel and correlation matrix 
In [9l[T0] we have found analytical approximation of P{e\i) given by 


P(e|i) 


det2 C 


27rVSC ^a + 2oC 

-.r 


J 


exp 


V 


bC-^b- 


\ 


aC ^a + 2oC 



The d, d, b are defined as follows 

/ — (Ay + y — i?) tan0cos“^ 0 


o = 


a = 


b = 


y—2Aycos ^6 

= y ~y = z — z. 


( 7 ) 


— (Ay + y + i?) tan0cos ‘^6 , 

(8) 

y—(Ay + ]pj cos“^ 0(1 + 2 tan^ 0) j 


^ -{Ay + y-R) cos“^ 0 \ 


-(Ay + y + R) cos“^ 0 , 

( 9 ) 

y—{Ay + jp) cos“^ 0tan oj 


(Az — Ay tan 


Az — Ay tan 0 

(10) 


and 


( 11 ) 
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The y and 2 are the coordinates of the reconstructed emission point and 
9 is the reconstructed emission angle of the event e. The y and 2 : are the 
coordinates of the center of pixel i. C is the correlation matrix which in 
general can be of the form: 




0 

7 ^ 

0 

1 

II 

0 

1 

-7 

1 


^7 

-7 


( 12 ) 


This matrix depends on the Zu and Zd- For cj^. Experimentally we have 
found this dependence to be quite weak on the order of 10 % from the center 
(lowest) to the edge (highest). We have found out that coefficient 7 can 
be neglected as long as we do not take into account events with z^^d) iiear 
the edge of the scintillators. This may change when we consider the full 
detector with longer (500mm) scintillator strips, but in this contribution we 
assume correlation matrix to be diagonal. Currently we achieve ~ 10 mm 
and (Ta; ~ 40mm. Please note that the last number corresponds to 20mm 
error for the position along the emission line as the distance from the line 
midpoint is equal to |AL 

Formula ([^ is, at least for the range of parameters we have studied, 
strongly dominated by the gaussian term bC~^b. This term defines an 3 cj 
ellipse (see figure [^. For practical purposes we can assume that the kernel 
is zero outside this ellipse. As it is easier to work with rectangular shapes we 
also define a bounding box consisting of an rectangle that is circumscribed 
on the ellipse (see Appendix [B|) . 


4. Implementation 

The iteration step described by formula Q can be implemented as de¬ 
scribed by the pseudocode in Listing 

Loops forCauto i : ellipse(e_j)) on lines 6 and 10 iterate over all 
pixels in the 3a ellipse of the event e. To calculate pixels contributing to 
this ellipse we first need to determine its bounding box in pixel space. Once 
bounding box is calculated we loop only trough pixels inside this bounding 
box. Each pixel is then tested if its center point resides inside or outside of 
the ellipse. Only then the whole kernel is calculated. The results are cached 
and used subsequently in the second loop. 


4 . 1 . CPU 

The CPU implementation follows essentially the algorithm from listing[^ 
We use OpenMP to parallelize the outer loop (line 5) over the events. Each 
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1 for (auto p_l : pixels) { 

2 rho_new[p_l] = 0.0; 

3 > 

4 for (auto e_j : events) { 

5 auto denominator = 0.0; 

6 for (auto i ; ellipse(e_j)) { 

7 kernel[i] = p(e_j, i); 

8 denominator += kernel[i] * rho[i]; 

9 } 

10 for (auto i ; ellipse(e_j)) { 

11 rho_new[i] += rho[1] * kernel[i] / denominator; 

12 } 

13 } 


Listing 1. Implementation of the reconstruction iteration routine. 


thread writes to its own copy of rho_new array which are added at the 
end of the iteration. Currently we do not take direct advantage of the 
AVX/SSE instruction set aside of automatic vectorization provided by Intel 
C++ Compiler. 


4-2. CPU implementation 

Next step was a naive GPU implementation based on our reference CPU 
implementation where each thread processes all pixels of single event, so few 
thousands of events are processed simultaneously by hardware threads. 

Such approach has however serious drawback on GPU hardware, which 
is essentially a vector computer. On the NVIDIA CUBA architecture that 
we use, the threads are collected in batches of 32 threads called warps. 
All threads in a warp must execute same instruction in parallel (SIMD). 
In the naive implementation each thread is processing a different events 
with different number of pixels. That amounts to a double loop with loops 
bounds different across the threads of a warp. This leads to severe thread 
divergence and as we have discovered carries a much higher penalty then 
naively expected. One would expect that the execution time of a warp, 
would be approximately the time needed to execute the longest loops, but as 
it turned out it is much higher. Additionally we cannot cache visited pixels 
and their kernel results since it is not enough registers or shared memory to 
store such information given each thread processes separate event. 

We can circumvent that using different pixel calculation scheduling where 
whole warp of 32 threads calculates a single event. This is called by us warp 
granularity (see Figure]^. As each thread in a warp process a single pixel 
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Fig. 3. Warp granularity (whole event processed by single warp) 


from the same event there is no divergence. Different events are processed 
by different warps which can run independently. This algorithm also lets us 
better leverage available shared memory and registers. However, processor 
cycles are still wasted by the threads that fail the bounding ellipse test. 

First it has to be noted that single event is calculated in two passes. 
First we need to calculate denominator of ([^. This pass needs bounding 
box to be calculated hrst, then each pixel in this pass is tested with 3-sigma 
ellipse equation. 

During first pass warp granularity gives us opportunity to cache visited 
pixels and kernel ([^ in shared memory and registers, so the second pass 
can loop only through visited already pixels without a need to test them 
for ellipse inclusion. Also we can cache kernel results in registers, since each 
thread in warp is likely to visit just few pixels of single event. 

Calculation of the denominator requires adding the contributions from 
the 32 threads of the warp. We have done this using the new shuffle instruc¬ 
tions introduced in Kepler architecture. This gave a notable performance 
boost over standard reduction algorithm using shared memory [TT]. 

Final optimization is to access p (previous iteration image buffer) as 
texture. This produces noticeable performance boost by using hardware 
GPU texture unit cache and special 2D access optimized memory layout. 
However it can be observed that memory access still takes around 35% of 





























•paper printed on August 10, 2015 


9 


overall iteration time after optimizations. 


5. Benchmarks and results 


We have benchmarked our GPU implementation on NVIDIA GeForce 
GTX 770 commodity card with 4GB memory and compute capability 3.0 
using CUDA SDK 6.5, CPU implementation on Intel Xeon CPU E5-1650 
v2 @ 3.50GHz with 6 cores using ICC 15.0.0 [Intel Composer XE 2015). 
The benchmark results are presented in Table while the results of recon¬ 
struction of the phantom after different number of iterations are presented 
in the figures 4c to[4f[ 


Number 
of Events 

CPU 

OpenMP 

GPU 

Thread 

GPU 

Warp 

Speedup 

CPU/Warp 

10 x10^ 

11.83 s 

0.69 

s 

0.47 

s 

25x 

20 x10® 

23.65 s 

1.38 

s 

0.93 

s 

25x 

30x10® 

35.46 s 

2.07 

s 

1.40 

s 

25x 

40x10® 

47.20 s 

2.75 

s 

1.86 

s 

25x 

50x10® 

58.99 s 

3.44 

s 

2.33 

s 

25x 

60x10® 

70.82 s 

4.13 

s 

2.80 

s 

25x 

70x10® 

82.73 s 

4.82 

s 

3.26 

s 

25x 

80x10® 

94.41 s 

5.50 

s 

3.73 

s 

25x 

90x10® 

106.15 s 

6.19 

s 

4.19 

s 

25x 

100 x10® 

118.04 s 

6.88 

s 

4.66 

s 

25x 


Table 1. Single iteration reconstruction time per number of events. 


6. Summary and outlook 


We have implemented and tested our reconstruction kernel on simulated 
data using realistic parameters obtained from experimental measurements. 
As seen from the figures 4c to the results are very encouraging, consid¬ 
ering the simplicity and the resolution of our setup. Implementing the re¬ 
construction algorithm on the commodity GPU provided a 25-fold speedup 
that allows real-time processing. One should note however that this speed is 
partly due to not taking advantage of the CPU vector AVX instruction set. 
The reason for this is that as we have already pointed out in |12j the CUDA 
and OpenCL programming model is inherently vectorized while CPU is still 
viewed as superscalar processor with vector instructions mixed in. This is 
only now slowly changing with introduction of new compiler pragmas to 
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deal explicitly with vectorization in a similar way as OpenMP deals with 
parallelization. 

In derivation of the Q we have assumed a very simple detector geom¬ 
etry with scintillators approximated by thin lines. In reality they have a 
rectangular cross section of 5x20mm^. To some extent this was taken into 
account by using the errors estimated from real scintillators. The model 
however must be validated on real data (which is already collected) and 
this is a subject of an ongoing investigation. 
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X 

y 

a 

b 

(t> 

P 

0 

0 

30 

60 

0 

0.3 

50 

-62 

10 

33 

-40 

0.3 

-50 

-63 

20 

33 

45 

0.5 

60 

65 

13 

14 

0 

0.5 

35 

55 

12 

12 

0 

0.7 

0 

0 

120 

no 

0 

0.1 


Table 2. Phantom description. 
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Appendix A 

Phantom 


Phantom definition is given in table Each row corresponds to an 
ellipse with center (x, y) the half-axes a and b rotated by angle (f> counter¬ 
clockwise. The p denotes the relative density of the tracer. When two 
ellipses overlap the p is taken from the topmost (in table) one. 


Appendix B 

Bounding box 


Given an ellipse defined by the equation 
Ay"^ + Cyz + Bz'^ = 

its bounding box is a rectangle with lower left corner at 


R 




(B.l) 


y = - 


z = — 


(B.2) 
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and symmetric upper right corner. This combined with 




crt 


a\iCOS^9 


= - 2AzAi/ 


2 tan0 


at 


at 


(B. 


^ 2/2 tan^ 9 


a\i cos‘^9^ 


allows us to calculate the bounding box of the 3 cj ellipse for each event. 
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(a) Ideal reconstruction 




(c) after 1 iteration 



(e) after 10 iterations 

Fig. 4. Phantom used in reconstruction, 
pixel size) 





(b) Direct reconstruction 



(d) after 5 iterations 



(f) after 25 iterations 
= 130mm, L = 300mm and 4 x 4mm 


